Load the required packages.
options(stringsAsFactors = FALSE)
library(tidyverse)
library(lubridate)
library(ggplot2)
library(ggthemes)
library(fastDummies)
library(scales)
library(glmnet)
library(zoo)
library(ISLR)
library(leaps)
library(rpart)
library(rpart.plot)
library(tree)
library(randomForest)
library(gbm)
# Set the ggtheme beforehead for all plots
theme_set(theme_clean())
This notebook is created by Cohort A Team 7 for the BA 810 team project. The content includes our business setting to a existing dataset and our attempts in machine learning model training.
Our team is planning to open a store called “Everything Avocado”, basically selling avocado food products including avocado toasts, tortilla chips with guacamole, and avocado smoothies. During opening preparation, we want to set a budget in buying raw avocados as inventories, so we would like to learn the future price for avocados.
Fortunately, all of the team members are taking a machine learning course together, then we decided to develope a model to predict avocado prices using existing data source.
We expect that our optimal model could be applied to other raw product retails, especially in fresh products (fruits and vegetables). Ideally, the price any products of this type could be predicted, if the prices are influenced by similar predictors. In a business setting, whoever wants to get a sense of future product price (retailers, customers, market analysts, etc.) will hold a huge interest in our model.
The dataset we keep using is the Avocado Price dataset from Kaggle. It is the record of historical data on avocado prices and sales volume in multiple US markets.
ds <- read_csv("avocado.csv")
Missing column names filled in: 'X1' [1]Parsed with column specification:
cols(
X1 = [32mcol_double()[39m,
Date = [34mcol_date(format = "")[39m,
AveragePrice = [32mcol_double()[39m,
`Total Volume` = [32mcol_double()[39m,
`4046` = [32mcol_double()[39m,
`4225` = [32mcol_double()[39m,
`4770` = [32mcol_double()[39m,
`Total Bags` = [32mcol_double()[39m,
`Small Bags` = [32mcol_double()[39m,
`Large Bags` = [32mcol_double()[39m,
`XLarge Bags` = [32mcol_double()[39m,
type = [31mcol_character()[39m,
year = [32mcol_double()[39m,
region = [31mcol_character()[39m
)
glimpse(ds)
Observations: 18,249
Variables: 14
$ X1 [3m[38;5;246m<dbl>[39m[23m 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12,…
$ Date [3m[38;5;246m<date>[39m[23m 2015-12-27, 2015-12-20, 2015-12-13, 2015…
$ AveragePrice [3m[38;5;246m<dbl>[39m[23m 1.33, 1.35, 0.93, 1.08, 1.28, 1.26, 0.99,…
$ `Total Volume` [3m[38;5;246m<dbl>[39m[23m 64236.62, 54876.98, 118220.22, 78992.15, …
$ `4046` [3m[38;5;246m<dbl>[39m[23m 1036.74, 674.28, 794.70, 1132.00, 941.48,…
$ `4225` [3m[38;5;246m<dbl>[39m[23m 54454.85, 44638.81, 109149.67, 71976.41, …
$ `4770` [3m[38;5;246m<dbl>[39m[23m 48.16, 58.33, 130.50, 72.58, 75.78, 43.61…
$ `Total Bags` [3m[38;5;246m<dbl>[39m[23m 8696.87, 9505.56, 8145.35, 5811.16, 6183.…
$ `Small Bags` [3m[38;5;246m<dbl>[39m[23m 8603.62, 9408.07, 8042.21, 5677.40, 5986.…
$ `Large Bags` [3m[38;5;246m<dbl>[39m[23m 93.25, 97.49, 103.14, 133.76, 197.69, 127…
$ `XLarge Bags` [3m[38;5;246m<dbl>[39m[23m 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ type [3m[38;5;246m<chr>[39m[23m "conventional", "conventional", "conventi…
$ year [3m[38;5;246m<dbl>[39m[23m 2015, 2015, 2015, 2015, 2015, 2015, 2015,…
$ region [3m[38;5;246m<chr>[39m[23m "Albany", "Albany", "Albany", "Albany", "…
Here is a quick view of columns and data types. There are 14 columns in this dataset, and we will use AveragePrice as our prediction. X1 gives a id for each observation, and we want to transfer that into row numbers for each row by adding 1. Most of the data types are <dbl>, and there are only 2 columns with character vectors, type and region. We will transfer them to dummy variables to train our models. The time span is from Janurary 4, 2015 to March 25, 2018.
Before going into the modeling, we made a plot for the average price of avocados grouped by date, and we observed that average price usually increases during the second half of the year.
ds %>%
group_by(Date) %>%
summarise(meanpriced = mean(AveragePrice)) %>%
ggplot(aes(x = Date, y = meanpriced)) +
geom_point(col = "#4a7337") +
geom_smooth(col = "#d9cd65", se = FALSE) +
labs(y = "Average price",
title = "Average price by date") +
theme_bw()
We also map the average quantity (Volume) by dates. We noticed that in the beginning of each year, there is one day when the volume sold is extremely high. We then checked the calendar and happened to find that these days were all Super Bowl game days!
ds %>%
group_by(Date) %>%
summarise(totalvold = sum(`Total Volume`)) %>%
ggplot(aes(x = Date, y = totalvold)) +
geom_point(col = "#4a7337") +
geom_smooth(col = "#d9cd65", se = FALSE) +
labs(y = "Total volumn sold",
title = "Total volumn sold by date") +
theme_bw()
In the following code chunk, we took multiple steps to arrange the columns.
# Switch the order of rows
ds <- ds %>%
group_by(type, region) %>%
select(X1, year, Date, type, region, everything()) %>%
arrange(Date)
# Change `X1` to `ID`
colName <- names(ds)
colName[1] <- "ID"
names(ds) <- colName
# Assign distinct number id to each observation in `ID` column
ds$ID <- seq(nrow(ds))
# Extract month from `Date` column
ds$month <- month(ds$Date)
ds <- ds %>%
select(ID, year, month, everything())
There are only 2 types of avocados recorded, conventional and organic. We decided to separate type column into type_conventional and type_organic.
dsNew <- dummy_cols(ds, select_columns = "type") %>%
select(ID, year, month, region, type_conventional, type_organic,
everything(), -type)
The Product Look Up (PLU) codes are used by grocery retailers to make differnet product inventories. This dataset includes 3 PLU avocados, and we added a new column other_PLU to see the volume sold in products without PLU. The number is calculated by substracting PLU volumes from total volumes.
dsNew$other_PLU <- dsNew$`Total Volume` - dsNew$`4046` - dsNew$`4225` - dsNew$`4770`
# Reorder the columes
dsNew <- dsNew %>%
select(1:3, Date, everything())
We initially included other_PLU as a predictor to our models, but then we found that other_PLU is eauql to Total Bags. This equivlance resolved our confusion on bag variables: ?????
Categorize region into US Areas
uniqueRegion <- unique(dsNew$region)
uniqueRegion <- as.data.frame(uniqueRegion)
uniqueRegion$Area <- NA
uniqueRegion$Area[1] <- "NewEngland"
uniqueRegion$Area[2] <- "Southeast"
uniqueRegion$Area[3] <- "Mideast"
uniqueRegion$Area[4] <- "RockyMountain"
uniqueRegion$Area[5] <- "NewEngland"
uniqueRegion$Area[6] <- "Mideast"
uniqueRegion$Area[7] <- "FarWest"
uniqueRegion$Area[8] <- "Southeast"
uniqueRegion$Area[9] <- "GreatLakes"
uniqueRegion$Area[10] <- "GreatLakes"
uniqueRegion$Area[11] <- "GreatLakes"
uniqueRegion$Area[12] <- "Southwest"
uniqueRegion$Area[13] <- "RockyMountain"
uniqueRegion$Area[14] <- "GreatLakes"
uniqueRegion$Area[15] <- "GreatLakes"
uniqueRegion$Area[16] <- "GreatLakes"
uniqueRegion$Area[17] <- "Mideast"
uniqueRegion$Area[18] <- "NewEngland"
uniqueRegion$Area[19] <- "Southeast"
uniqueRegion$Area[20] <- "GreatLakes"
uniqueRegion$Area[21] <- "Southeast"
uniqueRegion$Area[22] <- "FarWest"
uniqueRegion$Area[23] <- "FarWest"
uniqueRegion$Area[24] <- "Southeast"
uniqueRegion$Area[25] <- "Southeast"
uniqueRegion$Area[26] <- "Southeast"
uniqueRegion$Area[27] <- "Southeast"
uniqueRegion$Area[28] <- "Southeast"
uniqueRegion$Area[29] <- "Mideast"
uniqueRegion$Area[30] <- "NewEngland"
uniqueRegion$Area[31] <- "NewEngland"
uniqueRegion$Area[32] <- "Southeast"
uniqueRegion$Area[33] <- "Mideast"
uniqueRegion$Area[34] <- "Southwest"
uniqueRegion$Area[35] <- "Mideast"
uniqueRegion$Area[36] <- "Plains"
uniqueRegion$Area[37] <- "FarWest"
uniqueRegion$Area[38] <- "Southeast"
uniqueRegion$Area[39] <- "Southeast"
uniqueRegion$Area[40] <- "Southeast"
uniqueRegion$Area[41] <- "FarWest"
uniqueRegion$Area[42] <- "FarWest"
uniqueRegion$Area[43] <- "FarWest"
uniqueRegion$Area[44] <- "FarWest"
uniqueRegion$Area[45] <- "Southeast"
uniqueRegion$Area[46] <- "Southeast"
uniqueRegion$Area[47] <- "Southeast"
uniqueRegion$Area[48] <- "FarWest"
uniqueRegion$Area[49] <- "Plains"
uniqueRegion$Area[50] <- "Mideast"
uniqueRegion$Area[51] <- "Southeast"
uniqueRegion$Area[52] <- "TotalUS"
uniqueRegion$Area[53] <- "FarWest"
uniqueRegion$Area[54] <- "Southwest"
names(uniqueRegion)[1] <- "region"
avo <- dsNew %>%
left_join(uniqueRegion, by = "region") %>%
select(1:5, Area, everything())
avo <- dummy_cols(avo, select_columns = "Area")
Rename Column Names
colnames(avo)
[1] "ID" "year" "month"
[4] "Date" "region" "Area"
[7] "type_conventional" "type_organic" "AveragePrice"
[10] "Total Volume" "4046" "4225"
[13] "4770" "Total Bags" "Small Bags"
[16] "Large Bags" "XLarge Bags" "other_PLU"
[19] "Area_NewEngland" "Area_Southeast" "Area_Mideast"
[22] "Area_RockyMountain" "Area_FarWest" "Area_GreatLakes"
[25] "Area_Southwest" "Area_Plains" "Area_TotalUS"
names(avo)[10] <- "TotalVolume"
names(avo)[14] <- "TotalBags"
names(avo)[15] <- "SmallBags"
names(avo)[16] <- "LargeBags"
names(avo)[17] <- "XLargeBags"
names(avo)[11] <- "PLU4046"
names(avo)[12] <- "PLU4225"
names(avo)[13] <- "PLU4770"
set.seed(1234)
avo_train <- avo %>% filter(as.Date(Date) < "2017-03-01")
avo_train %>%
filter(year == 2017, month == 2)
avo_test <- avo %>% filter(as.Date(Date) >= "2017-03-01")
avo_test %>%
filter(year == 2018, month == 3)
Base model
f1 <- as.formula(AveragePrice ~ month+type_conventional + type_organic + TotalVolume +
PLU4046 + PLU4770 + PLU4225 + SmallBags + LargeBags + XLargeBags +
+ Area_NewEngland + Area_Southeast
+ Area_Mideast + Area_RockyMountain
+ Area_FarWest + Area_GreatLakes
+ Area_Southwest + Area_Plains + Area_TotalUS)
x1_train <- model.matrix(f1,avo_train)[,-1]
y1_train <- avo_train$AveragePrice
x1_test <- model.matrix(f1,avo_test)[,-1]
y1_test <- avo_test$AveragePrice
x1_avo <- model.matrix(f1, avo)[,-1]
fit.lm <- lm(f1, avo_train)
fit.lm
Call:
lm(formula = f1, data = avo_train)
Coefficients:
(Intercept) month type_conventional
1.525491 0.015248 -0.517023
type_organic TotalVolume PLU4046
NA -0.229128 0.229128
PLU4770 PLU4225 SmallBags
0.229128 0.229128 0.229128
LargeBags XLargeBags Area_NewEngland
0.229128 0.229129 0.194460
Area_Southeast Area_Mideast Area_RockyMountain
-0.062017 0.167542 -0.166883
Area_FarWest Area_GreatLakes Area_Southwest
-0.007661 -0.065708 -0.198306
Area_Plains Area_TotalUS
0.009257 NA
y.train <- avo_train$AveragePrice
y.test <- avo_test$AveragePrice
yhat.train.lm <- predict(fit.lm)
mse.train.lm <- mean((y.train - yhat.train.lm)^2)
yhat.test.lm <- predict(fit.lm, avo_test)
prediction from a rank-deficient fit may be misleading
mse.test.lm <- mean((y.test - yhat.test.lm)^2)
mse.train.lm
[1] 0.06454066
mse.test.lm
[1] 387.6257
coef(fit.lm)
(Intercept) month type_conventional
1.525490825 0.015248443 -0.517023414
type_organic TotalVolume PLU4046
NA -0.229128378 0.229128340
PLU4770 PLU4225 SmallBags
0.229128450 0.229128396 0.229128404
LargeBags XLargeBags Area_NewEngland
0.229128254 0.229128567 0.194459980
Area_Southeast Area_Mideast Area_RockyMountain
-0.062016571 0.167542328 -0.166883258
Area_FarWest Area_GreatLakes Area_Southwest
-0.007660860 -0.065707641 -0.198305711
Area_Plains Area_TotalUS
0.009257048 NA
plot(fit.lm)
avo1 <- avo
avo1$ID <- NULL
avo1$year<-NULL
avo1$Date<-NULL
avo1$region<-NULL
avo1$Area<-NULL
avo1$AveragePrice<-NULL
xnames <- colnames(avo1)
xnames <- xnames[!xnames %in% c("type_conventional","type_organic", "TotalVolume", "PLU4046", "PLU4770", "PLU4225","SmallBags", "LargeBags","XLargeBags","Area_NewEngland","Area_Southeast","Area_Mideast","Area_RockyMountain","Area_FarWest","Area_GreatLakes","Area_Southwest","Area_Plains + Area_TotalUS")]
fit_fw <- lm(AveragePrice ~ 1, data = avo_train)
yhat_train <- predict(fit_fw, avo_train)
mse_train <- mean((avo_train$AveragePrice - yhat_train) ^ 2)
yhat_test <- predict(fit_fw, avo_test)
mse_test <- mean((avo_test$AveragePrice - yhat_test) ^ 2)
log_fw <-
tibble(
xname = "intercept",
model = deparse(fit_fw$call),
mse_train = mse_train,
mse_test = mse_test
)
best_mse_train <- NA
best_mse_test <- NA
best_fit_fw <- NA
best_xname <- NA
for (xname in xnames) {
# take a moment to examine and understand the following line
fit_fw_tmp <- update(fit_fw, as.formula(paste0(" ~ ", xname)))
# compute MSE train
yhat_train_tmp <- predict(fit_fw_tmp, avo_train)
mse_train_tmp <- mean((avo_train$AveragePrice - yhat_train_tmp) ^ 2)
# compute MSE test
yhat_test_tmp <- predict(fit_fw_tmp, avo_test)
mse_test_tmp <- mean((avo_test$AveragePrice - yhat_test_tmp) ^ 2)
# if this is the first predictor to be examined,
# or if this predictors yields a lower MSE that the current
# best, then store this predictor as the current best predictor
if (is.na(best_mse_test) | mse_test_tmp < best_mse_test) {
best_xname <- xname
best_fit_fw <- fit_fw_tmp
best_mse_train <- mse_train_tmp
best_mse_test <- mse_test_tmp
}
}
best_mse_train
[1] 0.147114
best_mse_test
[1] 0.1893665
log_fw <-
log_fw %>% add_row(
xname = best_xname,
model = paste0(deparse(best_fit_fw$call), collapse = ""),
mse_train = best_mse_train,
mse_test = best_mse_test
)
log_fw
# here is a complete solution
xnames<- colnames(avo1)
xnames <- xnames[!xnames %in% c("type_conventional","type_organic", "TotalVolume", "PLU4046", "PLU4770", "PLU4225","SmallBags", "LargeBags","XLargeBags","Area_NewEngland","Area_Southeast","Area_Mideast","Area_RockyMountain","Area_FarWest","Area_GreatLakes","Area_Southwest","Area_Plains + Area_TotalUS")]
fit_fw <- lm(AveragePrice ~ 1, data = avo_train)
yhat_train <- predict(fit_fw, avo_train)
yhat_test <- predict(fit_fw, avo_test)
mse_train <- mean((avo_train$AveragePrice - yhat_train)^2)
mse_test <- mean((avo_test$AveragePrice - yhat_test)^2)
xname <- "intercept"
log_fw <-
tibble(
xname = xname,
model = paste0(deparse(fit_fw$call), collapse = ""),
mse_train = mse_train,
mse_test = mse_test
)
while (length(xnames) > 0) {
best_mse_train <- NA
best_mse_test <- NA
best_fit_fw <- NA
best_xname <- NA
# select the next best predictor
for (xname in xnames) {
# take a moment to examine and understand the following line
fit_fw_tmp <- update(fit_fw, as.formula(paste0(". ~ . + ", xname)))
# compute MSE train
yhat_train_tmp <- predict(fit_fw_tmp, avo_train)
mse_train_tmp <- mean((avo_train$AveragePrice - yhat_train_tmp) ^ 2)
# compute MSE test
yhat_test_tmp <- predict(fit_fw_tmp, avo_test)
mse_test_tmp <- mean((avo_test$AveragePrice - yhat_test_tmp) ^ 2)
# if this is the first predictor to be examined,
# or if this predictors yields a lower MSE that the current
# best, then store this predictor as the current best predictor
if (is.na(best_mse_test) | mse_test_tmp < best_mse_test) {
best_xname <- xname
best_fit_fw <- fit_fw_tmp
best_mse_train <- mse_train_tmp
best_mse_test <- mse_test_tmp
}
}
log_fw <-
log_fw %>% add_row(
xname = best_xname,
model = paste0(deparse(best_fit_fw$call), collapse = ""),
mse_train = best_mse_train,
mse_test = best_mse_test
)
# adopt the best model for the next iteration
fit_fw <- best_fit_fw
# remove the current best predictor from the list of predictors
xnames <- xnames[xnames!=best_xname]
}
prediction from a rank-deficient fit may be misleadingprediction from a rank-deficient fit may be misleadingprediction from a rank-deficient fit may be misleadingprediction from a rank-deficient fit may be misleadingprediction from a rank-deficient fit may be misleadingprediction from a rank-deficient fit may be misleading
ggplot(log_fw, aes(seq_along(xname), mse_test)) +
geom_point() +
geom_line() +
geom_point(aes(y=mse_train), color="#AA471F") +
geom_line(aes(y=mse_train), color="#AA471F") +
scale_x_continuous("Variables", labels = log_fw$xname, breaks = seq_along(log_fw$xname)) +
scale_y_continuous("MSE test") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
fw<- as.formula(AveragePrice ~ month + TotalBags + Area_TotalUS + Area_Plains + other_PLU)
fit.fw<-lm(fw, data = avo_train)
## calculate the yhat price for the avo dataset
yhat_avo_fw <- predict(fit.fw, avo)
prediction from a rank-deficient fit may be misleading
df1_fw<-avo %>%
select(Date, AveragePrice)
df2_fw<-cbind(df1_fw, yhat_avo_fw)
colnames(df2_fw)
[1] "Date" "AveragePrice" "yhat_avo_fw"
names(df2_fw)[3]<-"AveragePrice_hat"
## Plot the actual average price and the predictive average price
plot1_fw <- df2_fw %>%
group_by(Date) %>%
summarize(
MeanAvg=mean(AveragePrice),
MeanAvg_hat=mean(AveragePrice_hat))%>%
ggplot()+
geom_line(aes(Date, MeanAvg),color = "#356211")+
geom_line(aes(Date, MeanAvg_hat), color = "#cda989")+
theme_classic()
plot1_fw + geom_vline(aes(xintercept = as.numeric(Date[113])),
linetype = "dashed", size = 1,
color = "#AA471F")
f1 formula to match average price with all predictors in order to see the best model which contains a giving number of predictors.regfit.bwd = regsubsets(f1, data = avo_train, nvmax = 19, method = "backward")
2 linear dependencies found
Reordering variables and trying again:
number of items to replace is not a multiple of replacement length
summary(regfit.bwd)
Subset selection object
Call: regsubsets.formula(f1, data = avo_train, nvmax = 19, method = "backward")
19 Variables (and intercept)
Forced in Forced out
month FALSE FALSE
type_conventional FALSE FALSE
TotalVolume FALSE FALSE
PLU4046 FALSE FALSE
PLU4770 FALSE FALSE
PLU4225 FALSE FALSE
SmallBags FALSE FALSE
LargeBags FALSE FALSE
XLargeBags FALSE FALSE
Area_NewEngland FALSE FALSE
Area_Southeast FALSE FALSE
Area_Mideast FALSE FALSE
Area_RockyMountain FALSE FALSE
Area_FarWest FALSE FALSE
Area_GreatLakes FALSE FALSE
Area_Southwest FALSE FALSE
Area_Plains FALSE FALSE
type_organic FALSE FALSE
Area_TotalUS FALSE FALSE
1 subsets of each size up to 17
Selection Algorithm: backward
month type_conventional type_organic TotalVolume PLU4046
1 ( 1 ) " " "*" " " " " " "
2 ( 1 ) " " "*" " " " " " "
3 ( 1 ) " " "*" " " " " " "
4 ( 1 ) "*" "*" " " " " " "
5 ( 1 ) "*" "*" " " " " " "
6 ( 1 ) "*" "*" " " " " " "
7 ( 1 ) "*" "*" " " " " " "
8 ( 1 ) "*" "*" " " "*" " "
9 ( 1 ) "*" "*" " " "*" " "
10 ( 1 ) "*" "*" " " "*" " "
11 ( 1 ) "*" "*" " " "*" " "
12 ( 1 ) "*" "*" " " "*" " "
13 ( 1 ) "*" "*" " " "*" "*"
14 ( 1 ) "*" "*" " " "*" "*"
15 ( 1 ) "*" "*" " " "*" "*"
16 ( 1 ) "*" "*" " " "*" "*"
17 ( 1 ) "*" "*" " " "*" "*"
PLU4770 PLU4225 SmallBags LargeBags XLargeBags
1 ( 1 ) " " " " " " " " " "
2 ( 1 ) " " " " " " " " " "
3 ( 1 ) " " " " " " " " " "
4 ( 1 ) " " " " " " " " " "
5 ( 1 ) " " " " " " " " " "
6 ( 1 ) " " " " " " " " " "
7 ( 1 ) " " " " " " " " " "
8 ( 1 ) " " " " " " " " " "
9 ( 1 ) " " " " " " " " " "
10 ( 1 ) " " "*" " " " " " "
11 ( 1 ) " " "*" "*" " " " "
12 ( 1 ) "*" "*" "*" " " " "
13 ( 1 ) "*" "*" "*" " " " "
14 ( 1 ) "*" "*" "*" " " "*"
15 ( 1 ) "*" "*" "*" "*" "*"
16 ( 1 ) "*" "*" "*" "*" "*"
17 ( 1 ) "*" "*" "*" "*" "*"
Area_NewEngland Area_Southeast Area_Mideast
1 ( 1 ) " " " " " "
2 ( 1 ) " " " " "*"
3 ( 1 ) "*" " " "*"
4 ( 1 ) "*" " " "*"
5 ( 1 ) "*" " " "*"
6 ( 1 ) "*" " " "*"
7 ( 1 ) "*" "*" "*"
8 ( 1 ) "*" "*" "*"
9 ( 1 ) "*" "*" "*"
10 ( 1 ) "*" "*" "*"
11 ( 1 ) "*" "*" "*"
12 ( 1 ) "*" "*" "*"
13 ( 1 ) "*" "*" "*"
14 ( 1 ) "*" "*" "*"
15 ( 1 ) "*" "*" "*"
16 ( 1 ) "*" "*" "*"
17 ( 1 ) "*" "*" "*"
Area_RockyMountain Area_FarWest Area_GreatLakes
1 ( 1 ) " " " " " "
2 ( 1 ) " " " " " "
3 ( 1 ) " " " " " "
4 ( 1 ) " " " " " "
5 ( 1 ) " " " " " "
6 ( 1 ) "*" " " " "
7 ( 1 ) "*" " " " "
8 ( 1 ) "*" " " " "
9 ( 1 ) "*" " " "*"
10 ( 1 ) "*" " " "*"
11 ( 1 ) "*" " " "*"
12 ( 1 ) "*" " " "*"
13 ( 1 ) "*" " " "*"
14 ( 1 ) "*" " " "*"
15 ( 1 ) "*" " " "*"
16 ( 1 ) "*" " " "*"
17 ( 1 ) "*" "*" "*"
Area_Southwest Area_Plains Area_TotalUS
1 ( 1 ) " " " " " "
2 ( 1 ) " " " " " "
3 ( 1 ) " " " " " "
4 ( 1 ) " " " " " "
5 ( 1 ) "*" " " " "
6 ( 1 ) "*" " " " "
7 ( 1 ) "*" " " " "
8 ( 1 ) "*" " " " "
9 ( 1 ) "*" " " " "
10 ( 1 ) "*" " " " "
11 ( 1 ) "*" " " " "
12 ( 1 ) "*" " " " "
13 ( 1 ) "*" " " " "
14 ( 1 ) "*" " " " "
15 ( 1 ) "*" " " " "
16 ( 1 ) "*" "*" " "
17 ( 1 ) "*" "*" " "
Calculate the yhat and MSE for train and test dataset, the test MSE is way to large for the model including all the predictors.
sub_model<-lm(f1, data = avo_train)
yhat_train_stepwise <- predict(sub_model, avo_train)
prediction from a rank-deficient fit may be misleading
MSE_train_stepwise <- mean((avo_train$AveragePrice - yhat_train_stepwise)^2)
MSE_train_stepwise
[1] 0.06454066
yhat_test_stepwise <- predict(sub_model, avo_test)
prediction from a rank-deficient fit may be misleading
MSE_test_stepwise <- mean((avo_test$AveragePrice - yhat_test_stepwise)^2)
MSE_test_stepwise
[1] 387.6257
Based on the result of summary, the best fit model contains everything except “Area_TotalUS”, “type_organic”. So we excluded these two predictors to create a new formula to train the model.
f1_1 <- as.formula(AveragePrice ~ month + type_conventional + TotalVolume +
PLU4046 + PLU4770 + PLU4225 + SmallBags + LargeBags + XLargeBags +
+ Area_NewEngland + Area_Southeast
+ Area_Mideast + Area_RockyMountain
+ Area_FarWest + Area_GreatLakes
+ Area_Southwest
+ Area_Plains)
regfit.bwd1 = regsubsets(f1_1, data = avo_train, nvmax = 19, method = "backward")
summary(regfit.bwd1)
Subset selection object
Call: regsubsets.formula(f1_1, data = avo_train, nvmax = 19, method = "backward")
17 Variables (and intercept)
Forced in Forced out
month FALSE FALSE
type_conventional FALSE FALSE
TotalVolume FALSE FALSE
PLU4046 FALSE FALSE
PLU4770 FALSE FALSE
PLU4225 FALSE FALSE
SmallBags FALSE FALSE
LargeBags FALSE FALSE
XLargeBags FALSE FALSE
Area_NewEngland FALSE FALSE
Area_Southeast FALSE FALSE
Area_Mideast FALSE FALSE
Area_RockyMountain FALSE FALSE
Area_FarWest FALSE FALSE
Area_GreatLakes FALSE FALSE
Area_Southwest FALSE FALSE
Area_Plains FALSE FALSE
1 subsets of each size up to 17
Selection Algorithm: backward
month type_conventional TotalVolume PLU4046 PLU4770
1 ( 1 ) " " "*" " " " " " "
2 ( 1 ) " " "*" " " " " " "
3 ( 1 ) " " "*" " " " " " "
4 ( 1 ) "*" "*" " " " " " "
5 ( 1 ) "*" "*" " " " " " "
6 ( 1 ) "*" "*" " " " " " "
7 ( 1 ) "*" "*" " " " " " "
8 ( 1 ) "*" "*" "*" " " " "
9 ( 1 ) "*" "*" "*" " " " "
10 ( 1 ) "*" "*" "*" " " " "
11 ( 1 ) "*" "*" "*" " " " "
12 ( 1 ) "*" "*" "*" " " "*"
13 ( 1 ) "*" "*" "*" "*" "*"
14 ( 1 ) "*" "*" "*" "*" "*"
15 ( 1 ) "*" "*" "*" "*" "*"
16 ( 1 ) "*" "*" "*" "*" "*"
17 ( 1 ) "*" "*" "*" "*" "*"
PLU4225 SmallBags LargeBags XLargeBags Area_NewEngland
1 ( 1 ) " " " " " " " " " "
2 ( 1 ) " " " " " " " " " "
3 ( 1 ) " " " " " " " " "*"
4 ( 1 ) " " " " " " " " "*"
5 ( 1 ) " " " " " " " " "*"
6 ( 1 ) " " " " " " " " "*"
7 ( 1 ) " " " " " " " " "*"
8 ( 1 ) " " " " " " " " "*"
9 ( 1 ) " " " " " " " " "*"
10 ( 1 ) "*" " " " " " " "*"
11 ( 1 ) "*" "*" " " " " "*"
12 ( 1 ) "*" "*" " " " " "*"
13 ( 1 ) "*" "*" " " " " "*"
14 ( 1 ) "*" "*" " " "*" "*"
15 ( 1 ) "*" "*" "*" "*" "*"
16 ( 1 ) "*" "*" "*" "*" "*"
17 ( 1 ) "*" "*" "*" "*" "*"
Area_Southeast Area_Mideast Area_RockyMountain
1 ( 1 ) " " " " " "
2 ( 1 ) " " "*" " "
3 ( 1 ) " " "*" " "
4 ( 1 ) " " "*" " "
5 ( 1 ) " " "*" " "
6 ( 1 ) " " "*" "*"
7 ( 1 ) "*" "*" "*"
8 ( 1 ) "*" "*" "*"
9 ( 1 ) "*" "*" "*"
10 ( 1 ) "*" "*" "*"
11 ( 1 ) "*" "*" "*"
12 ( 1 ) "*" "*" "*"
13 ( 1 ) "*" "*" "*"
14 ( 1 ) "*" "*" "*"
15 ( 1 ) "*" "*" "*"
16 ( 1 ) "*" "*" "*"
17 ( 1 ) "*" "*" "*"
Area_FarWest Area_GreatLakes Area_Southwest Area_Plains
1 ( 1 ) " " " " " " " "
2 ( 1 ) " " " " " " " "
3 ( 1 ) " " " " " " " "
4 ( 1 ) " " " " " " " "
5 ( 1 ) " " " " "*" " "
6 ( 1 ) " " " " "*" " "
7 ( 1 ) " " " " "*" " "
8 ( 1 ) " " " " "*" " "
9 ( 1 ) " " "*" "*" " "
10 ( 1 ) " " "*" "*" " "
11 ( 1 ) " " "*" "*" " "
12 ( 1 ) " " "*" "*" " "
13 ( 1 ) " " "*" "*" " "
14 ( 1 ) " " "*" "*" " "
15 ( 1 ) " " "*" "*" " "
16 ( 1 ) " " "*" "*" "*"
17 ( 1 ) "*" "*" "*" "*"
sub_model1<-lm(f1_1, data = avo_train)
yhat_train_stepwise1 <- predict(sub_model1, avo_train)
MSE_train_stepwise1 <- mean((avo_train$AveragePrice - yhat_train_stepwise1)^2)
MSE_train_stepwise1
[1] 0.06454066
yhat_test_stepwise1 <- predict(sub_model1, avo_test)
MSE_test_stepwise1 <- mean((avo_test$AveragePrice - yhat_test_stepwise1)^2)
MSE_test_stepwise1
[1] 387.6257
However, the test MSE is still way to large, therefore, we decide to remove the predictor one by one based on the result of summary. We try to remove “Area_Plains”, “Area_FarWest”, “LargeBags” one by one, we found that removing “LargeBags” can significantly reduce the test MSE to 0.1397455 which is the lowest MSE that we have tried. So the best fit model is f1_2.
f1_2 <- as.formula(AveragePrice ~ month + type_conventional + TotalVolume +
PLU4046 + PLU4770 + PLU4225 + SmallBags + XLargeBags
+ Area_NewEngland +
Area_Mideast + Area_RockyMountain
+ Area_FarWest + Area_GreatLakes + Area_Southwest + Area_Plains
)
regfit.bwd2 = regsubsets(f1_2, data = avo_train, nvmax = 19, method = "backward")
summary(regfit.bwd2)
Subset selection object
Call: regsubsets.formula(f1_2, data = avo_train, nvmax = 19, method = "backward")
15 Variables (and intercept)
Forced in Forced out
month FALSE FALSE
type_conventional FALSE FALSE
TotalVolume FALSE FALSE
PLU4046 FALSE FALSE
PLU4770 FALSE FALSE
PLU4225 FALSE FALSE
SmallBags FALSE FALSE
XLargeBags FALSE FALSE
Area_NewEngland FALSE FALSE
Area_Mideast FALSE FALSE
Area_RockyMountain FALSE FALSE
Area_FarWest FALSE FALSE
Area_GreatLakes FALSE FALSE
Area_Southwest FALSE FALSE
Area_Plains FALSE FALSE
1 subsets of each size up to 15
Selection Algorithm: backward
month type_conventional TotalVolume PLU4046 PLU4770
1 ( 1 ) " " "*" " " " " " "
2 ( 1 ) " " "*" " " " " " "
3 ( 1 ) " " "*" " " " " " "
4 ( 1 ) "*" "*" " " " " " "
5 ( 1 ) "*" "*" " " " " " "
6 ( 1 ) "*" "*" " " " " " "
7 ( 1 ) "*" "*" " " " " " "
8 ( 1 ) "*" "*" "*" " " " "
9 ( 1 ) "*" "*" "*" " " " "
10 ( 1 ) "*" "*" "*" " " " "
11 ( 1 ) "*" "*" "*" " " " "
12 ( 1 ) "*" "*" "*" " " "*"
13 ( 1 ) "*" "*" "*" "*" "*"
14 ( 1 ) "*" "*" "*" "*" "*"
15 ( 1 ) "*" "*" "*" "*" "*"
PLU4225 SmallBags XLargeBags Area_NewEngland
1 ( 1 ) " " " " " " " "
2 ( 1 ) " " " " " " " "
3 ( 1 ) " " " " " " "*"
4 ( 1 ) " " " " " " "*"
5 ( 1 ) " " " " " " "*"
6 ( 1 ) " " " " " " "*"
7 ( 1 ) " " " " " " "*"
8 ( 1 ) " " " " " " "*"
9 ( 1 ) "*" " " " " "*"
10 ( 1 ) "*" " " " " "*"
11 ( 1 ) "*" "*" " " "*"
12 ( 1 ) "*" "*" " " "*"
13 ( 1 ) "*" "*" " " "*"
14 ( 1 ) "*" "*" "*" "*"
15 ( 1 ) "*" "*" "*" "*"
Area_Mideast Area_RockyMountain Area_FarWest
1 ( 1 ) " " " " " "
2 ( 1 ) "*" " " " "
3 ( 1 ) "*" " " " "
4 ( 1 ) "*" " " " "
5 ( 1 ) "*" " " " "
6 ( 1 ) "*" "*" " "
7 ( 1 ) "*" "*" "*"
8 ( 1 ) "*" "*" "*"
9 ( 1 ) "*" "*" "*"
10 ( 1 ) "*" "*" "*"
11 ( 1 ) "*" "*" "*"
12 ( 1 ) "*" "*" "*"
13 ( 1 ) "*" "*" "*"
14 ( 1 ) "*" "*" "*"
15 ( 1 ) "*" "*" "*"
Area_GreatLakes Area_Southwest Area_Plains
1 ( 1 ) " " " " " "
2 ( 1 ) " " " " " "
3 ( 1 ) " " " " " "
4 ( 1 ) " " " " " "
5 ( 1 ) " " "*" " "
6 ( 1 ) " " "*" " "
7 ( 1 ) " " "*" " "
8 ( 1 ) " " "*" " "
9 ( 1 ) " " "*" " "
10 ( 1 ) " " "*" "*"
11 ( 1 ) " " "*" "*"
12 ( 1 ) " " "*" "*"
13 ( 1 ) " " "*" "*"
14 ( 1 ) " " "*" "*"
15 ( 1 ) "*" "*" "*"
sub_model2<-lm(f1_2, data = avo_train)
yhat_train_stepwise2 <- predict(sub_model2, avo_train)
MSE_train_stepwise2 <- mean((avo_train$AveragePrice - yhat_train_stepwise2)^2)
MSE_train_stepwise2
[1] 0.06473134
yhat_test_stepwise2 <- predict(sub_model2, avo_test)
MSE_test_stepwise2 <- mean((avo_test$AveragePrice - yhat_test_stepwise2)^2)
MSE_test_stepwise2
[1] 0.1397455
Calculate the yhat price for the avo dataset which contain train and test dataset and merge them to the a single date frame.
yhat_avo_avgprice <- predict(sub_model2, avo)
df1_bwd<-avo %>%
select(Date, AveragePrice)
df2_bwd<-cbind(df1_bwd, yhat_avo_avgprice)
colnames(df2_bwd)
[1] "Date" "AveragePrice" "yhat_avo_avgprice"
names(df2_bwd)[3]<-"AveragePrice_hat"
Plot the actual average price and the predictive average price, the prediction for Backward selection is very similar with ridge and lasso based on the graph which will show as following.
plot1_bwd <- df2_bwd %>%
group_by(Date) %>%
summarize(
MeanAvg=mean(AveragePrice),
MeanAvg_hat=mean(AveragePrice_hat))%>%
ggplot()+
geom_line(aes(Date, MeanAvg),color = "#356211")+
geom_line(aes(Date, MeanAvg_hat), color = "#cda989")+
theme_classic()
plot1_bwd + geom_vline(aes(xintercept = as.numeric(Date[113])),
linetype = "dashed", size = 1,
color = "#AA471F")
run ridge
fit_ridge <- cv.glmnet(x1_train, y1_train, alpha = 0, nfolds = 100)
fit_ridge$lambda
[1] 261.99201530 238.71736676 217.51037385 198.18735174
[5] 180.58093365 164.53862122 149.92146362 136.60285400
[9] 124.46743295 113.41008927 103.33504952 94.15504853
[13] 85.79057353 78.16917543 71.22484134 64.89742274
[17] 59.13211457 53.87898050 49.09252037 44.73127617
[21] 40.75747288 37.13669133 33.83757002 30.83153357
[25] 28.09254510 25.59688082 23.32292448 21.25098015
[29] 19.36310165 17.64293707 16.07558716 14.64747631
[33] 13.34623488 12.16059216 11.08027867 10.09593725
[37] 9.19904201 8.38182448 7.63720629 6.95873793
[41] 6.34054284 5.77726649 5.26403005 4.79638813
[45] 4.37029023 3.98204569 3.62829173 3.30596429
[49] 3.01227154 2.74466964 2.50084075 2.27867295
[53] 2.07624193 1.89179432 1.72373252 1.57060088
[57] 1.43107302 1.30394044 1.18810197 1.08255426
[61] 0.98638312 0.89875557 0.81891260 0.74616266
[65] 0.67987563 0.61947735 0.56444469 0.51430098
[69] 0.46861190 0.42698171 0.38904983 0.35448772
[73] 0.32299600 0.29430193 0.26815696 0.24433463
[77] 0.22262862 0.20285090 0.18483019 0.16841038
[81] 0.15344927 0.13981726 0.12739628 0.11607875
[85] 0.10576663 0.09637061 0.08780931 0.08000858
[89] 0.07290084 0.06642453 0.06052355 0.05514681
[93] 0.05024772 0.04578385 0.04171654 0.03801056
[97] 0.03463380 0.03155703 0.02875359 0.02619920
Predict response
yhat_train_ridge <- predict(fit_ridge, x1_train, s = fit_ridge$lambda)
yhat_test_ridge <- predict(fit_ridge, x1_test, s = fit_ridge$lambda)
mse_train_ridge = vector()
mse_test_ridge = vector()
mse_train_ridge <- mean((y1_train - yhat_train_ridge)^2)
mse_test_ridge <-mean((y1_test - yhat_test_ridge)^2)
for (i in 1:length(fit_ridge$lambda)) {
mse_train_ridge[i] <- mean((y1_train - yhat_train_ridge)[,i]^2)
mse_test_ridge[i] <- mean((y1_test - yhat_test_ridge)[,i]^2)
}
mse_train_ridge
[1] 0.15012060 0.14949727 0.14943704 0.14937104 0.14929873
[6] 0.14921953 0.14913279 0.14903783 0.14893387 0.14882010
[11] 0.14869563 0.14855948 0.14841061 0.14824789 0.14807008
[16] 0.14787589 0.14766389 0.14743256 0.14718028 0.14690530
[21] 0.14660576 0.14627969 0.14592501 0.14553950 0.14512083
[26] 0.14466658 0.14417419 0.14364104 0.14306439 0.14244146
[31] 0.14176939 0.14104530 0.14026632 0.13942957 0.13853228
[36] 0.13757171 0.13654533 0.13545073 0.13428578 0.13304862
[41] 0.13173774 0.13035204 0.12889089 0.12735415 0.12574233
[46] 0.12405657 0.12229876 0.12047153 0.11857835 0.11662358
[51] 0.11461245 0.11255113 0.11044666 0.10830698 0.10614085
[56] 0.10395775 0.10176778 0.09958151 0.09740983 0.09526373
[61] 0.09315410 0.09109152 0.08908605 0.08714747 0.08528368
[66] 0.08350210 0.08180885 0.08020874 0.07870524 0.07730049
[71] 0.07599533 0.07478934 0.07368098 0.07266770 0.07174781
[76] 0.07091361 0.07016229 0.06948863 0.06888726 0.06835268
[81] 0.06787937 0.06746188 0.06709495 0.06677351 0.06649281
[86] 0.06624837 0.06603608 0.06585213 0.06569309 0.06555584
[91] 0.06543771 0.06533589 0.06524842 0.06517338 0.06510903
[96] 0.06505385 0.06500653 0.06496546 0.06493110 0.06490090
mse_test_ridge
[1] 0.1963519 0.1958176 0.1957660 0.1957095 0.1956476 0.1955798
[7] 0.1955056 0.1954244 0.1953355 0.1952383 0.1951320 0.1950157
[13] 0.1948887 0.1947499 0.1945983 0.1944329 0.1942524 0.1940556
[19] 0.1938411 0.1936076 0.1933534 0.1930770 0.1927767 0.1924507
[25] 0.1920972 0.1917141 0.1912995 0.1908514 0.1903675 0.1898458
[31] 0.1892842 0.1886804 0.1880324 0.1873380 0.1865955 0.1858028
[37] 0.1849584 0.1840607 0.1831086 0.1821010 0.1810374 0.1799175
[43] 0.1787414 0.1775098 0.1762238 0.1748851 0.1734960 0.1720594
[49] 0.1705790 0.1690590 0.1675044 0.1659210 0.1643153 0.1626943
[55] 0.1610657 0.1594378 0.1578192 0.1562189 0.1546458 0.1531091
[61] 0.1516174 0.1501791 0.1488020 0.1474937 0.1462592 0.1451038
[67] 0.1440313 0.1430443 0.1421442 0.1413313 0.1406046 0.1399621
[73] 0.1394009 0.1389173 0.1385057 0.1381629 0.1378829 0.1376603
[79] 0.1374894 0.1373648 0.1372811 0.1372329 0.1372156 0.1372244
[85] 0.1372551 0.1373040 0.1373676 0.1374427 0.1375268 0.1376173
[91] 0.1377152 0.1378158 0.1379155 0.1380143 0.1381120 0.1382081
[97] 0.1383017 0.1383898 0.1384754 0.1385589
min(mse_train_ridge)
[1] 0.0649009
min(mse_test_ridge)
[1] 0.1372156
lambda_min_mse_train_ridge<- fit_ridge$lambda[which.min(mse_train_ridge)]
lambda_min_mse_test_ridge <-fit_ridge$lambda[which.min(mse_test_ridge)]
lambda_min_mse_train_ridge
[1] 0.0261992
lambda_min_mse_test_ridge
[1] 0.1273963
yhat_train_ridge <- predict(fit_ridge, x1_train, s = 0.0261992)
yhat_test_ridge <- predict(fit_ridge, x1_test, s = 0.1160787)
yhat_1<- predict(fit_ridge, x1_avo, s = lambda_min_mse_test_ridge)
Aggregate data into one dataframe
p1<-avo %>%
select(Date, AveragePrice)
p2<-cbind(p1, yhat_1)
Plot
p2 %>%
group_by(Date)%>%
summarise(meanpriced = mean(AveragePrice))
names(p2)[3] <- "pre"
p2 %>%
group_by(Date) %>%
summarise(meanpriced = mean(AveragePrice),meanpre = mean(pre))%>%
ggplot()+
geom_line(mapping = aes(x=Date,
y=meanpriced), col = "#356211")+
geom_line(mapping = aes(x=Date, y= meanpre), col = "#cda989")+
geom_vline(xintercept=as.numeric(as.Date("2017-03-01")), col = "#AA471F", linetype = "dashed") +
labs(title = "Ridge Regression",
y = "Mean Price")
The ridge model is pretty much the same with lasso model, so we will see the difference between two models to see which model gives a more accurate prediction for our dataset after building a lasso model.
Run lasso model
lmodel <- glmnet(x1_train, y1_train, alpha = 1, nlambda = 100)
lmodel$lambda
[1] 0.2619920153 0.2387173668 0.2175103739 0.1981873517
[5] 0.1805809337 0.1645386212 0.1499214636 0.1366028540
[9] 0.1244674330 0.1134100893 0.1033350495 0.0941550485
[13] 0.0857905735 0.0781691754 0.0712248413 0.0648974227
[17] 0.0591321146 0.0538789805 0.0490925204 0.0447312762
[21] 0.0407574729 0.0371366913 0.0338375700 0.0308315336
[25] 0.0280925451 0.0255968808 0.0233229245 0.0212509802
[29] 0.0193631016 0.0176429371 0.0160755872 0.0146474763
[33] 0.0133462349 0.0121605922 0.0110802787 0.0100959373
[37] 0.0091990420 0.0083818245 0.0076372063 0.0069587379
[41] 0.0063405428 0.0057772665 0.0052640301 0.0047963881
[45] 0.0043702902 0.0039820457 0.0036282917 0.0033059643
[49] 0.0030122715 0.0027446696 0.0025008407 0.0022786730
[53] 0.0020762419 0.0018917943 0.0017237325 0.0015706009
[57] 0.0014310730 0.0013039404 0.0011881020 0.0010825543
[61] 0.0009863831 0.0008987556 0.0008189126 0.0007461627
[65] 0.0006798756 0.0006194773 0.0005644447 0.0005143010
[69] 0.0004686119 0.0004269817 0.0003890498 0.0003544877
[73] 0.0003229960 0.0002943019 0.0002681570 0.0002443346
Predict response shows the minimue mse for the train(0.06472616) and test(0.1390842).
y1_train_hat <- predict(lmodel, s = lmodel$lambda, newx = x1_train)
y1_test_hat <- predict(lmodel, s = lmodel$lambda, newx = x1_test)
mse_train = vector()
mse_test = vector()
for (i in 1:length(lmodel$lambda)) {
mse_train[i] <- mean((y1_train - y1_train_hat)[,i]^2)
mse_test[i] <- mean((y1_test - y1_test_hat)[,i]^2)
}
min(mse_train)
[1] 0.06472616
min(mse_test)
[1] 0.1390842
Check the minimun lambda for the train and test dataset
lambda_min_mse_train_lasso<- lmodel$lambda[which.min(mse_train)]
lambda_min_mse_test_lasso <-lmodel$lambda[which.min(mse_test)]
lambda_min_mse_train_lasso
[1] 0.0002443346
lambda_min_mse_test_lasso
[1] 0.002500841
Using Cross-validation fucntion to find the best lambda, the result shows the best lambda for train dataset is same as above.
set.seed(1)
cv.out = cv.glmnet(x1_train, y1_train, alpha = 1)
## plot the lambda
plot(cv.out)
##check the best lambda
bestlam = cv.out$lambda.min
bestlam ## the best lamdba for training dataset is same as lambda_min_mse_train
[1] 0.0002443346
Check the coefficient for f1 model, the result shows that there are some predictors might be eliminated. “TotalVolume”, “PLU4225”, SmallBags“, and”Area_TotalUS“.
f1coef<-coef(lmodel, s = lambda_min_mse_test_lasso)
f1coef
20 x 1 sparse Matrix of class "dgCMatrix"
1
(Intercept) 1.494037e+00
month 1.455341e-02
type_conventional -5.117866e-01
type_organic 1.250828e-11
TotalVolume .
PLU4046 -8.506223e-09
PLU4770 3.264729e-09
PLU4225 .
SmallBags .
LargeBags -4.707824e-08
XLargeBags 2.318309e-07
Area_NewEngland 2.294789e-01
Area_Southeast -2.893281e-02
Area_Mideast 1.977691e-01
Area_RockyMountain -1.238285e-01
Area_FarWest 1.884770e-02
Area_GreatLakes -2.182073e-02
Area_Southwest -1.614236e-01
Area_Plains 2.748295e-02
Area_TotalUS .
Create a new formula for new predictors(eliminate the uncorrelated predictors)
f2 <- as.formula(AveragePrice ~ month + type_conventional + type_organic +
PLU4046 + PLU4770 + LargeBags + XLargeBags +
Area_NewEngland + Area_Southeast + Area_Mideast +
Area_RockyMountain + Area_GreatLakes +
Area_Southwest + Area_Plains)
x2_train <- model.matrix(f2,avo_train)[,-1]
x2_test <- model.matrix(f2, avo_test)[,-1]
Run lasso model again with new predictors
lmodel2 <- glmnet(x2_train, y1_train, alpha = 1, nlambda = 100)
Predict response with new predictors
y2_train_hat <- predict(lmodel2, s = lmodel2$lambda, newx = x2_train)
y2_test_hat <- predict(lmodel2, s = lmodel2$lambda, newx = x2_test)
Compute MES again with new predictors, the results shows the taining data MSE is increased when eliminate the uncorrelated predictors, and the train MSE (0.06482449) increases a little but the test MSE reduces a little bit (0.1390709).
mse_train1 = vector()
mse_test1 = vector()
for (i in 1:length(lmodel2$lambda)) {
mse_train1[i] <- mean((y1_train - y2_train_hat)[,i]^2)
mse_test1[i] <- mean((y1_test - y2_test_hat)[,i]^2)
}
min(mse_train1)
[1] 0.06482449
min(mse_test1)
[1] 0.1390709
Check again with the new minimum lambda for the train and test dataset
lambda_min_mse_train1<- lmodel2$lambda[which.min(mse_train1)]
lambda_min_mse_test1 <-lmodel2$lambda[which.min(mse_test1)]
lambda_min_mse_train1
[1] 0.0003890498
lambda_min_mse_test1
[1] 0.002500841
Check coefficients for f2, there is no more uncorrelated predictors in our model.
f2coef<-coef(lmodel2, s = lambda_min_mse_test1)
f2coef
15 x 1 sparse Matrix of class "dgCMatrix"
1
(Intercept) 1.512389e+00
month 1.454115e-02
type_conventional -5.109947e-01
type_organic 1.145366e-11
PLU4046 -9.401350e-09
PLU4770 1.431715e-09
LargeBags -4.939522e-08
XLargeBags 2.314856e-07
Area_NewEngland 2.107071e-01
Area_Southeast -4.740335e-02
Area_Mideast 1.790446e-01
Area_RockyMountain -1.423959e-01
Area_GreatLakes -4.033916e-02
Area_Southwest -1.797424e-01
Area_Plains 9.054443e-03
Since the second model have the lowest test MSE, we decided to use “lmodel2” to make the final prediction.
x2_avo <- model.matrix(f2, avo)[,-1]
y2_avo_min_lambda_hat <- predict(lmodel2, s = lambda_min_mse_test1, newx = x2_avo)
class(y2_avo_min_lambda_hat)
[1] "matrix"
Aggregate data into one dataframe for the first model prediction
df1<-avo %>%
select(Date, AveragePrice)
df2<-cbind(df1, y2_avo_min_lambda_hat)
names(df2)[3]<-"AveragePrice_hat"
Plot the actual average price and the predictive average price
head(df2)
Date AveragePrice AveragePrice_hat
1 2015-01-04 1.22 1.2265901
2 2015-01-04 1.00 0.9636202
3 2015-01-04 1.08 1.1943327
4 2015-01-04 1.01 0.8730412
5 2015-01-04 1.02 1.2265626
6 2015-01-04 1.40 1.1949496
class(df2$Date)
[1] "Date"
plot1 <- df2 %>%
group_by(Date) %>%
summarize(
MeanAvg=mean(AveragePrice),
MeanAvg_hat=mean(AveragePrice_hat)) %>%
ggplot()+
geom_line(aes(Date, MeanAvg),color = "#356211")+
geom_line(aes(Date, MeanAvg_hat), color = "#cda989")+
theme_classic()
plot1 + geom_vline(aes(xintercept = as.numeric(Date[113])),
linetype = "dashed", size = 1,
color = "#AA471F")
We can see that our ridge model actually did a better job than lasso. Ridge is more flexibly than lasso, maybe each variable have something to do in our data set, so ridge works better than lasso.
We use rpart function to build decision trees.
fit.tree <- rpart(f1,
avo_train,
control = rpart.control(cp = 0.001))
par(xpd = TRUE)
plot(fit.tree, compress=TRUE)
text(fit.tree, use.n=TRUE)
rpart.plot(fit.tree, type = 1)
yhat.train.tree <- predict(fit.tree, avo_train)
mse.train.tree <- mean((avo_train$AveragePrice - yhat.train.tree)^2)
mse.train.tree
[1] 0.03348875
yhat.test.tree <- predict(fit.tree, avo_test)
mse.test.tree <- mean((avo_test$AveragePrice - yhat.test.tree)^2)
mse.test.tree
[1] 0.1470199
The outcome is too difficult to read, even after we adjust the type. The test MSE is 0.1468156, which is greater than Lasso. Decision trees make little sense in predicting prices, and we trained this one just for practice.
We use tree function to create regression trees.
tree.avo =tree(f1, avo_train)
summary(tree.avo)
Regression tree:
tree(formula = f1, data = avo_train)
Variables actually used in tree construction:
[1] "type_organic" "PLU4046" "Area_Mideast" "LargeBags"
[5] "PLU4225" "month"
Number of terminal nodes: 9
Residual mean deviance: 0.05779 = 704.7 / 12190
Distribution of residuals:
Min. 1st Qu. Median Mean 3rd Qu. Max.
-0.93200 -0.15470 -0.01157 0.00000 0.13910 1.25200
cv.avo =cv.tree(tree.avo)
prune.avo =prune.tree(tree.avo,best =5)
plot(prune.avo)
text(prune.avo,pretty =0)
yhat <- predict(tree.avo, newdata = avo_test)
mse_regreTree_test <- mean((yhat - avo_test$AveragePrice)^2)
mse_regreTree_test
[1] 0.1499934
The terminal nodes show the price per avocado in each branch, and the test MSE is 0.1499934, even greater than Decision Tree.
We use randomForest function to build bagging trees by setting mtry to the number of predictors in our base function f1.
set.seed (1)
bag.avo =randomForest(f1, data = avo_train,
mtry = 19, importance =TRUE)
bag.avo
Call:
randomForest(formula = f1, data = avo_train, mtry = 19, importance = TRUE)
Type of random forest: regression
Number of trees: 500
No. of variables tried at each split: 19
Mean of squared residuals: 0.01439899
% Var explained: 90.41
Then we create a scatter plot of test prediction and test actual prices, and add a trend line in 45 degrees.
yhat.bag = predict (bag.avo, newdata = avo_test)
plot(yhat.bag, avo_test$AveragePrice, main = "Scatter Plot for Bagging",
xlab = "Prediction price in test set", ylab = "Average price in test set",
col = "#bdcc64")
abline (0,1)
mse_bag_test <- mean((yhat.bag - avo_test$AveragePrice)^2)
mse_bag_test
[1] 0.1350875
As we can see from this graph, the scatters are clustered above the trendline. Ideally, the scatters should lie on top of the trend line. The test MSE is 0.1350875, and this should be treated as the bottomline for other tree models.
Building the model first:
fit_rf <- randomForest(f1,
avo_train,
ntree=300,
do.trace=F)
varImpPlot(fit_rf)
Predicting our y_hats for train and for test data:
yhat_rf_train <- predict(fit_rf, avo_train)
yhat_rf_test <- predict(fit_rf, avo_test)
Calculating MSE for both train and test:
mse_rf_train <- mean((yhat_rf_train - y1_train) ^ 2)
mse_rf_test <- mean((yhat_rf_test - y1_test)^2)
print(mse_rf_train)
[1] 0.003639148
print(mse_rf_test)
[1] 0.1279945
Plotting actual prices and predicted prices on one plot: Have to prepare data to use it for the plot first by creating a dataframe with Date, AveragePrice and Predicted Price (y_hat) for both train and test:
avo %>%
select(Date, AveragePrice) -> plot_rf_full
yhat_rf_full <- c(yhat_rf_train,yhat_rf_test )
cbind(plot_rf_full, yhat_rf_full) -> plot_rf_full1
plot_rf_full1 %>%
group_by(Date) %>%
summarise(mean_y = mean(AveragePrice),
mean_yhat = mean(yhat_rf_full)) %>%
ggplot() +
geom_line(aes(x = Date, y = mean_y), col = "#356211") +
geom_line(aes(x = Date, y = mean_yhat), col = "#cda989") -> rf_graph
rf_graph + geom_vline(aes(xintercept = as.numeric(Date[113])),
linetype = "dashed", size = 1,
color = "#AA471F") +
labs(title = "Random Forest",
y = "Mean Price")
As we can see on the graph above, with 300 trees, random forest fit train data pretty well, however it is doing bad with test data. Nevertheless, the graph looks better than other less flexible models we have used.
We could have tried to increase the number of trees to see if it would give better estimations, however, unfortunately, the capacity of RStudios that we have used is not enough to run random forests with more than 300 trees.
We use the gbm function to run boosting trees. We set the number of trees to be 150, which is half of the trees in Random Forest. The interaction.depth is set to 4 and shrinkage is 0.1. We add a commend in this function, cv.folds, to apply KNN cross-validation in traing, and we split the training set into 10 folds.
boostingcv <- gbm(f1,
data = avo_train,
distribution = "gaussian",
n.trees = 150,
interaction.depth = 4,
cv.folds = 10,
shrinkage = 0.1)
NAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercion
relative.influence(boostingcv)
n.trees not given. Using 150 trees.
month type_conventional type_organic
179.480280 1109.504577 953.431615
TotalVolume PLU4046 PLU4770
97.033943 398.272468 92.155986
PLU4225 SmallBags LargeBags
199.758316 98.467418 518.245860
XLargeBags Area_NewEngland Area_Southeast
9.652739 85.927019 26.551777
Area_Mideast Area_RockyMountain Area_FarWest
99.033165 16.475180 49.161052
Area_GreatLakes Area_Southwest Area_Plains
15.895736 60.478064 5.685901
Area_TotalUS
1.326137
yhat_btree <- predict(boostingcv, avo_train, n.trees = 150)
mse_btree <- mean((yhat_btree - y1_train) ^ 2)
print(mse_btree)
[1] 0.02992749
Here we get the train MSE for boosting. Then we calculate the test MSE.
yhat_btree_test <- predict(boostingcv, avo_test, n.trees = 150)
mse_btree_test <- mean((yhat_btree_test - y1_test) ^ 2)
print(mse_btree_test)
[1] 0.1293277
Next, we add the boosting predictions into train and test sets to plot the trend for predictions and actuals.
avo_train$prediction_btree <- yhat_btree
avo_test$prediction_btree <- yhat_btree_test
avo_plot <- rbind(avo_train, avo_test)
btree_plot <- avo_plot %>%
group_by(Date) %>%
summarise(meanAvg = mean(AveragePrice),
meanAvg_hat = mean(prediction_btree)) %>%
ggplot() +
geom_line(aes(Date, meanAvg), col = "#356211") +
geom_line(aes(Date, meanAvg_hat), col = "#cda989") +
labs(title = "Boosting",
y = "Mean Price") +
theme_clean()
btree_plot +
geom_vline(aes(xintercept = as.numeric(Date[113])),
linetype = "dashed", size = 1, col = "#AA471F")
The boosting model gives a good illustration in seasonality, though the tset MSE is a little bit higher then Random Forest. At the end of this step, we prefer Random Forest as our choice of model
ds1 <- read_csv("avocado3.csv")
Missing column names filled in: 'X1' [1]Parsed with column specification:
cols(
X1 = [32mcol_double()[39m,
Date = [34mcol_date(format = "")[39m,
AveragePrice = [32mcol_double()[39m,
`Total Volume` = [32mcol_double()[39m,
`4046` = [32mcol_double()[39m,
`4225` = [32mcol_double()[39m,
`4770` = [32mcol_double()[39m,
`Total Bags` = [32mcol_double()[39m,
`Small Bags` = [32mcol_double()[39m,
`Large Bags` = [32mcol_double()[39m,
`XLarge Bags` = [32mcol_double()[39m,
type = [31mcol_character()[39m,
year = [32mcol_double()[39m,
region = [31mcol_character()[39m,
NewPrice = [32mcol_double()[39m,
NewPrice2 = [32mcol_double()[39m
)
glimpse(ds1)
Observations: 18,249
Variables: 16
$ X1 [3m[38;5;246m<dbl>[39m[23m 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 5…
$ Date [3m[38;5;246m<date>[39m[23m 2015-01-04, 2015-01-04, 2015-01-04, 2015…
$ AveragePrice [3m[38;5;246m<dbl>[39m[23m 1.22, 1.00, 1.08, 1.01, 1.02, 1.40, 0.93,…
$ `Total Volume` [3m[38;5;246m<dbl>[39m[23m 40873.28, 435021.49, 788025.06, 80034.32,…
$ `4046` [3m[38;5;246m<dbl>[39m[23m 2819.50, 364302.39, 53987.31, 44562.12, 7…
$ `4225` [3m[38;5;246m<dbl>[39m[23m 28287.42, 23821.16, 552906.04, 24964.23, …
$ `4770` [3m[38;5;246m<dbl>[39m[23m 49.90, 82.15, 39995.03, 2752.35, 128.82, …
$ `Total Bags` [3m[38;5;246m<dbl>[39m[23m 9716.46, 46815.79, 141136.68, 7755.62, 87…
$ `Small Bags` [3m[38;5;246m<dbl>[39m[23m 9186.93, 16707.15, 137146.07, 6064.30, 87…
$ `Large Bags` [3m[38;5;246m<dbl>[39m[23m 529.53, 30108.64, 3990.61, 1691.32, 256.2…
$ `XLarge Bags` [3m[38;5;246m<dbl>[39m[23m 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 3375.…
$ type [3m[38;5;246m<chr>[39m[23m "conventional", "conventional", "conventi…
$ year [3m[38;5;246m<dbl>[39m[23m 2015, 2015, 2015, 2015, 2015, 2015, 2015,…
$ region [3m[38;5;246m<chr>[39m[23m "Albany", "Atlanta", "BaltimoreWashington…
$ NewPrice [3m[38;5;246m<dbl>[39m[23m 0.00, 1.22, 1.00, 1.08, 1.01, 1.02, 1.40,…
$ NewPrice2 [3m[38;5;246m<dbl>[39m[23m 0.00, 0.00, 1.22, 1.00, 1.08, 1.01, 1.02,…
The new dataset called avocado3 is the dataset that we created based on the original avocado dataset. We create two more columns called “NewPrice” and “NewPrice2” which was based on the previous price that we have as “AveragePrice”. The NewPrice was come from the AveragePrice of one row above and the “NewPrice2” was from the AveragePrice of two rows above. We are going to add these two columns as new variables to do a more accurate prediction for our time-series based dataset.
ds1 <- ds1 %>%
group_by(type, region) %>%
select(X1, year, Date, type, region, everything()) %>%
arrange(Date)
ds1 <- ds1 %>%
group_by(type, region) %>%
select(X1, year, Date, type, region, everything()) %>%
arrange(Date)
colName <- names(ds1)
colName[1] <- "ID"
names(ds1) <- colName
ds1$ID <- seq(nrow(ds1))
ds1$month <- month(ds1$Date)
ds1 <- ds1 %>%
select(ID, year, month, everything())
ds1New <- dummy_cols(ds1, select_columns = "type") %>%
select(ID, year, month, region, type_conventional, type_organic,
everything(), -type)
ds1New$other_PLU <- ds1New$`Total Volume` - ds1New$`4046` - ds1New$`4225` - ds1New$`4770`
ds1New <- ds1New %>%
select(1:3, Date, everything())
uniqueRegion <- unique(ds1New$region)
uniqueRegion <- as.data.frame(uniqueRegion)
uniqueRegion$Area <- NA
uniqueRegion$Area[1] <- "NewEngland"
uniqueRegion$Area[2] <- "Southeast"
uniqueRegion$Area[3] <- "Mideast"
uniqueRegion$Area[4] <- "RockyMountain"
uniqueRegion$Area[5] <- "NewEngland"
uniqueRegion$Area[6] <- "Mideast"
uniqueRegion$Area[7] <- "FarWest"
uniqueRegion$Area[8] <- "Southeast"
uniqueRegion$Area[9] <- "GreatLakes"
uniqueRegion$Area[10] <- "GreatLakes"
uniqueRegion$Area[11] <- "GreatLakes"
uniqueRegion$Area[12] <- "Southwest"
uniqueRegion$Area[13] <- "RockyMountain"
uniqueRegion$Area[14] <- "GreatLakes"
uniqueRegion$Area[15] <- "GreatLakes"
uniqueRegion$Area[16] <- "GreatLakes"
uniqueRegion$Area[17] <- "Mideast"
uniqueRegion$Area[18] <- "NewEngland"
uniqueRegion$Area[19] <- "Southeast"
uniqueRegion$Area[20] <- "GreatLakes"
uniqueRegion$Area[21] <- "Southeast"
uniqueRegion$Area[22] <- "FarWest"
uniqueRegion$Area[23] <- "FarWest"
uniqueRegion$Area[24] <- "Southeast"
uniqueRegion$Area[25] <- "Southeast"
uniqueRegion$Area[26] <- "Southeast"
uniqueRegion$Area[27] <- "Southeast"
uniqueRegion$Area[28] <- "Southeast"
uniqueRegion$Area[29] <- "Mideast"
uniqueRegion$Area[30] <- "NewEngland"
uniqueRegion$Area[31] <- "NewEngland"
uniqueRegion$Area[32] <- "Southeast"
uniqueRegion$Area[33] <- "Mideast"
uniqueRegion$Area[34] <- "Southwest"
uniqueRegion$Area[35] <- "Mideast"
uniqueRegion$Area[36] <- "Plains"
uniqueRegion$Area[37] <- "FarWest"
uniqueRegion$Area[38] <- "Southeast"
uniqueRegion$Area[39] <- "Southeast"
uniqueRegion$Area[40] <- "Southeast"
uniqueRegion$Area[41] <- "FarWest"
uniqueRegion$Area[42] <- "FarWest"
uniqueRegion$Area[43] <- "FarWest"
uniqueRegion$Area[44] <- "FarWest"
uniqueRegion$Area[45] <- "Southeast"
uniqueRegion$Area[46] <- "Southeast"
uniqueRegion$Area[47] <- "Southeast"
uniqueRegion$Area[48] <- "FarWest"
uniqueRegion$Area[49] <- "Plains"
uniqueRegion$Area[50] <- "Mideast"
uniqueRegion$Area[51] <- "Southeast"
uniqueRegion$Area[52] <- "TotalUS"
uniqueRegion$Area[53] <- "FarWest"
uniqueRegion$Area[54] <- "Southwest"
names(uniqueRegion)[1] <- "region"
avo <- ds1New %>%
left_join(uniqueRegion, by = "region") %>%
select(1:5, Area, everything())
avo <- dummy_cols(avo, select_columns = "Area")
names(avo)[10] <- "TotalVolume"
names(avo)[14] <- "TotalBags"
names(avo)[15] <- "SmallBags"
names(avo)[16] <- "LargeBags"
names(avo)[17] <- "XLargeBags"
names(avo)[11] <- "PLU4046"
names(avo)[12] <- "PLU4225"
names(avo)[13] <- "PLU4770"
set.seed(1234)
avo_train <- avo %>% filter(as.Date(Date) < "2017-03-01")
avo_train %>%
filter(year == 2017, month == 2)
avo_test <- avo %>% filter(as.Date(Date) >= "2017-03-01")
avo_test %>%
filter(year == 2018, month == 3)
pre model
f2 <- as.formula(AveragePrice ~ month +type_conventional + type_organic + TotalVolume +
PLU4046 + PLU4770 + PLU4225 + SmallBags + LargeBags + XLargeBags +
+ Area_NewEngland + Area_Southeast
+ Area_Mideast + Area_RockyMountain
+ Area_FarWest
+ Area_GreatLakes + Area_Southwest
+ Area_Plains + Area_TotalUS + NewPrice + NewPrice2)
x1_train <- model.matrix(f2,avo_train)[,-1]
y1_train <- avo_train$AveragePrice
x1_test <- model.matrix(f2, avo_test)[,-1]
y1_test <- avo_test$AveragePrice
date_test <- avo_test$Date
x1_avo <- model.matrix(f2, avo)[,-1]
Run lasso
lmodel <- glmnet(x1_train, y1_train, alpha = 1, nlambda = 100)
lmodel$lambda
[1] 0.2619920153 0.2387173668 0.2175103739 0.1981873517
[5] 0.1805809337 0.1645386212 0.1499214636 0.1366028540
[9] 0.1244674330 0.1134100893 0.1033350495 0.0941550485
[13] 0.0857905735 0.0781691754 0.0712248413 0.0648974227
[17] 0.0591321146 0.0538789805 0.0490925204 0.0447312762
[21] 0.0407574729 0.0371366913 0.0338375700 0.0308315336
[25] 0.0280925451 0.0255968808 0.0233229245 0.0212509802
[29] 0.0193631016 0.0176429371 0.0160755872 0.0146474763
[33] 0.0133462349 0.0121605922 0.0110802787 0.0100959373
[37] 0.0091990420 0.0083818245 0.0076372063 0.0069587379
[41] 0.0063405428 0.0057772665 0.0052640301 0.0047963881
[45] 0.0043702902 0.0039820457 0.0036282917 0.0033059643
[49] 0.0030122715 0.0027446696 0.0025008407 0.0022786730
[53] 0.0020762419 0.0018917943 0.0017237325 0.0015706009
[57] 0.0014310730 0.0013039404 0.0011881020 0.0010825543
[61] 0.0009863831 0.0008987556 0.0008189126 0.0007461627
[65] 0.0006798756 0.0006194773 0.0005644447 0.0005143010
[69] 0.0004686119 0.0004269817 0.0003890498 0.0003544877
[73] 0.0003229960 0.0002943019 0.0002681570 0.0002443346
[77] 0.0002226286
Predict response
y1_train_hat <- predict(lmodel, s = lmodel$lambda, newx = x1_train)
y1_test_hat <- predict(lmodel, s = lmodel$lambda, newx = x1_test)
length(y1_test_hat)
[1] 465542
mse_train = vector()
mse_test = vector()
for (i in 1:length(lmodel$lambda)) {
mse_train[i] <- mean((y1_train - y1_train_hat)[,i]^2)
mse_test[i] <- mean((y1_test - y1_test_hat)[,i]^2)
}
mse_train
[1] 0.15012060 0.13846677 0.12879155 0.12075901 0.11409026
[6] 0.10855374 0.10395723 0.10014113 0.09697293 0.09434263
[11] 0.09199804 0.08978030 0.08792382 0.08620959 0.08478002
[16] 0.08255566 0.08003344 0.07794176 0.07620517 0.07449818
[21] 0.07296746 0.07155953 0.07038881 0.06941684 0.06860990
[26] 0.06794111 0.06734291 0.06678767 0.06631274 0.06591720
[31] 0.06555875 0.06524122 0.06497820 0.06475930 0.06457757
[36] 0.06442670 0.06429494 0.06416733 0.06405669 0.06396450
[41] 0.06388794 0.06382482 0.06377207 0.06372820 0.06369136
[46] 0.06365996 0.06363365 0.06360427 0.06357663 0.06355334
[51] 0.06351938 0.06349116 0.06345885 0.06342691 0.06340217
[56] 0.06338119 0.06336399 0.06334973 0.06333763 0.06332777
[61] 0.06331962 0.06331264 0.06330724 0.06330286 0.06329890
[66] 0.06329546 0.06329288 0.06328888 0.06328599 0.06328380
[71] 0.06328181 0.06328004 0.06327853 0.06327746 0.06327648
[76] 0.06327561 0.06327503
mse_test
[1] 0.1963519 0.1866676 0.1787868 0.1723894 0.1672105 0.1630314
[7] 0.1596718 0.1569827 0.1548414 0.1531467 0.1511442 0.1483245
[13] 0.1459111 0.1431670 0.1408326 0.1385649 0.1364580 0.1347493
[19] 0.1333634 0.1320312 0.1308327 0.1296515 0.1286929 0.1279137
[25] 0.1272820 0.1267663 0.1262994 0.1259077 0.1256045 0.1253750
[31] 0.1251657 0.1249740 0.1248329 0.1247305 0.1246567 0.1246056
[37] 0.1245521 0.1244743 0.1243997 0.1243466 0.1243063 0.1242713
[43] 0.1242488 0.1242333 0.1242222 0.1242172 0.1242134 0.1241673
[49] 0.1240975 0.1240429 0.1240655 0.1240820 0.1241052 0.1240991
[55] 0.1240995 0.1241009 0.1241053 0.1241115 0.1241174 0.1241246
[61] 0.1241325 0.1241400 0.1241477 0.1241602 0.1241680 0.1241748
[67] 0.1241816 0.1241928 0.1242047 0.1242185 0.1242335 0.1242481
[73] 0.1242621 0.1242733 0.1242854 0.1242967 0.1243042
min(mse_test)
[1] 0.1240429
lambda_min_mse_train<- lmodel$lambda[which.min(mse_train)]
lambda_min_mse_test <-lmodel$lambda[which.min(mse_test)]
lambda_min_mse_train
[1] 0.0002226286
lambda_min_mse_test
[1] 0.00274467
Using Cross-validation fucntion to find the best lambda
set.seed(1)
cv.out = cv.glmnet(x1_train, y1_train, alpha = 1)
Plot the lambda
plot(cv.out,
xlab = "Log(lambda)")
Check the best lambda
bestlam = cv.out$lambda.min
bestlam ## the best lamdba for training dataset is same as lambda_min_mse_train
[1] 0.0002226286
Create a new formula for new predictors(eliminate the uncorrelated predictors)
f3 <- as.formula(AveragePrice ~ month + type_conventional + type_organic +
PLU4046 + PLU4770 + PLU4225 + LargeBags + XLargeBags +
Area_NewEngland + Area_Southeast + Area_Mideast +
Area_RockyMountain + Area_GreatLakes +
Area_Southwest + Area_Plains)
x2_train <- model.matrix(f3,avo_train)[,-1]
x2_test <- model.matrix(f3, avo_test)[,-1]
Run lasso model again with new predictors
lmodel2 <- glmnet(x2_train, y1_train, alpha = 1, nlambda = 100)
Predict response with new predictors
y2_train_hat <- predict(lmodel2, s = lmodel2$lambda, newx = x2_train)
y2_test_hat <- predict(lmodel2, s = lmodel2$lambda, newx = x2_test)
Compute MES again with new predictors. The results shows the taining data MSE is increased when eliminate the predictors which don’t have correlation but the test data MSE still keep the same.
mse_train1 = vector()
mse_test1 = vector()
for (i in 1:length(lmodel2$lambda)) {
mse_train1[i] <- mean((y1_train - y2_train_hat)[,i]^2)
mse_test1[i] <- mean((y1_test - y2_test_hat)[,i]^2)
}
mse_train1
[1] 0.15012060 0.13846677 0.12879155 0.12075901 0.11409026
[6] 0.10855374 0.10395723 0.10014113 0.09697293 0.09434263
[11] 0.09215892 0.09034596 0.08884081 0.08759121 0.08655376
[16] 0.08431018 0.08196395 0.07990953 0.07779947 0.07604766
[21] 0.07449047 0.07309754 0.07194110 0.07098100 0.07018392
[26] 0.06952091 0.06887432 0.06830444 0.06782869 0.06743077
[31] 0.06707446 0.06675856 0.06649647 0.06627891 0.06609828
[36] 0.06594833 0.06579625 0.06565543 0.06553841 0.06544197
[41] 0.06536123 0.06529218 0.06523639 0.06518931 0.06515058
[46] 0.06511806 0.06509062 0.06505655 0.06502858 0.06500546
[51] 0.06498445 0.06495657 0.06492196 0.06489168 0.06486665
[56] 0.06484615 0.06482917 0.06481507 0.06480312 0.06479365
[61] 0.06478540 0.06477871 0.06477295 0.06476919 0.06476505
[66] 0.06476156 0.06475901 0.06475670 0.06475470 0.06475300
[71] 0.06475183 0.06475049 0.06474963 0.06474884
mse_test1
[1] 0.1963519 0.1866676 0.1787868 0.1723894 0.1672105 0.1630314
[7] 0.1596718 0.1569827 0.1548414 0.1531467 0.1518155 0.1507794
[13] 0.1499820 0.1493773 0.1489275 0.1480109 0.1471125 0.1463090
[19] 0.1450664 0.1441028 0.1432008 0.1423409 0.1416688 0.1411489
[25] 0.1407521 0.1404541 0.1401995 0.1399675 0.1398037 0.1397067
[31] 0.1396211 0.1395406 0.1394979 0.1394826 0.1394880 0.1395091
[37] 0.1394704 0.1394028 0.1393537 0.1393220 0.1392987 0.1392808
[43] 0.1392813 0.1392815 0.1392885 0.1392955 0.1392985 0.1392217
[49] 0.1391593 0.1391064 0.1390709 0.1391076 0.1391526 0.1391682
[55] 0.1391818 0.1391983 0.1392159 0.1392337 0.1392502 0.1392675
[61] 0.1392842 0.1392995 0.1393142 0.1393305 0.1393470 0.1393587
[67] 0.1393693 0.1393810 0.1393914 0.1394004 0.1394073 0.1394171
[73] 0.1394228 0.1394295
min(mse_train1)
[1] 0.06474884
min(mse_test1)
[1] 0.1390709
lambda_min_mse_train1<- lmodel2$lambda[which.min(mse_train1)]
lambda_min_mse_test1 <-lmodel2$lambda[which.min(mse_test1)]
lambda_min_mse_train1
[1] 0.0002943019
lambda_min_mse_test1
[1] 0.002500841
Check coefficients for f2 and f3 model
f2coef<-coef(lmodel, s = lambda_min_mse_test)
f3coef<-coef(lmodel2, s = lambda_min_mse_test1)
f2coef
22 x 1 sparse Matrix of class "dgCMatrix"
1
(Intercept) 1.240517e+00
month 1.193989e-02
type_conventional -4.315907e-01
type_organic 1.037824e-11
TotalVolume .
PLU4046 -9.803572e-09
PLU4770 1.862104e-10
PLU4225 .
SmallBags .
LargeBags -3.217985e-08
XLargeBags 1.559099e-07
Area_NewEngland 2.244133e-01
Area_Southeast -2.055361e-02
Area_Mideast 2.041316e-01
Area_RockyMountain -1.001925e-01
Area_FarWest 2.251057e-02
Area_GreatLakes -6.889180e-03
Area_Southwest -1.465750e-01
Area_Plains 3.874623e-02
Area_TotalUS .
NewPrice 9.836736e-02
NewPrice2 6.594030e-02
f3coef
16 x 1 sparse Matrix of class "dgCMatrix"
1
(Intercept) 1.512389e+00
month 1.454115e-02
type_conventional -5.109947e-01
type_organic 1.145366e-11
PLU4046 -9.401350e-09
PLU4770 1.431715e-09
PLU4225 .
LargeBags -4.939522e-08
XLargeBags 2.314856e-07
Area_NewEngland 2.107071e-01
Area_Southeast -4.740335e-02
Area_Mideast 1.790446e-01
Area_RockyMountain -1.423959e-01
Area_GreatLakes -4.033916e-02
Area_Southwest -1.797424e-01
Area_Plains 9.054443e-03
Since the second model have the training MSE increased, the ideal model is still the first “lmodel”, so we decide to use “lmodel” to run the prediction
y1_avo_min_lambda_hat <- predict(lmodel, s = lambda_min_mse_test, newx = x1_avo)
class(y1_avo_min_lambda_hat)
[1] "matrix"
Aggregate data into one dataframe for the first model prediction
df2<-avo %>%
select(Date, AveragePrice)
df3<-cbind(df2, y1_avo_min_lambda_hat)
colnames(df3)
[1] "Date" "AveragePrice" "1"
names(df3)[3]<-"AveragePrice_hat"
Plot the actual average price and the predictive average price
head(df3)
Date AveragePrice AveragePrice_hat
1 2015-01-04 1.22 1.0452352
2 2015-01-04 1.00 0.9157808
3 2015-01-04 1.08 1.2031624
4 2015-01-04 1.01 0.8923604
5 2015-01-04 1.02 1.2157677
6 2015-01-04 1.40 1.1919005
class(df3$Date)
[1] "Date"
df3 %>%
group_by(Date) %>%
summarise(meanpriced = mean(AveragePrice),meanpre = mean(AveragePrice_hat))%>%
ggplot()+
geom_line(mapping = aes(x=Date,
y=meanpriced), col = "#356211")+
geom_line(mapping = aes(x=Date, y= meanpre), col = "#cda989")+
geom_vline(xintercept=as.numeric(as.Date("2017-03-01")), col = "#AA471F", linetype = "dashed", size = 1)+
labs(title = "Lasso with new variables",
y = "Mean Price") +
theme_classic()
We can see from this graph, that compare with the regular lasso that we did before, this graph did a much better prediction with our dataset. It not only showed a season based change for avocado price but also have some sort of change within one year period. However, Although this model have the lowest Test MSE, we can see that the prediction is not as good as some model that we have before. So it require us to work further on this time series dataset to get more accurate prediction.
Because we are working with time series dataset, it is reasonable to assume that the previous day price plays a big part in predicting a next day price. Therefore, we have tried to create a code that would predict one day price and use it as a predictor for the next day price.
Let’s make a new dataframe with lagged 1 day price as an additional variable, and change it to zero for test data (as in real life we would not know the previous day price):
avo_ts <- avo %>%
mutate(lag1_y = lag(AveragePrice))
dim(avo_ts)
[1] 18249 28
avo_ts <- drop_na(avo_ts)
dim(avo_ts)
[1] 18248 28
avo_train_ts <- avo_ts[1:12202, ]
avo_test_ts <- avo_ts[12203:18248, ]
dim(avo_test_ts)
[1] 6046 28
avo_test_ts[2:6046,"lag1_y"] <- 0
Running a loop to predict each day using previous day prediction:
f_ts <- as.formula(AveragePrice ~ lag1_y + month + type_conventional + type_organic +
PLU4046 + PLU4770 + PLU4225 + LargeBags + XLargeBags +
Area_NewEngland + Area_Southeast + Area_Mideast +
Area_RockyMountain + Area_GreatLakes +
Area_Southwest + Area_Plains)
y_hat_ts_test_f <- avo_test_ts[1,"lag1_y"]
avo_ts1 <- avo_ts
lambda_min_test <- 0.001570601 #the min lambda from lasso model above
check_ts <- vector()
for (i in 1:6045) {
avo_train_ts <- avo_ts1[1:(12203+i), ]
avo_test_ts <- avo_ts1[(12204+i):18248, ]
x_ts_train <- model.matrix(f_ts,avo_train_ts)[,-1]
x_ts_test <- model.matrix(f_ts, avo_test_ts)[,-1]
y_ts_train <- avo_train_ts$AveragePrice
y_ts_test <- avo_test_ts$AveragePrice
lmodel_ts <- glmnet(x_ts_train, y_ts_train, alpha = 1, nlambda = 100)
y_ts_train_hat <- predict(lmodel_ts, s = lambda_min_test, newx = x_ts_train)
y_ts_test_hat <- predict(lmodel_ts, s = lambda_min_test, newx = x_ts_test)
y_hat_ts_test_f <- c(y_hat_ts_test_f, y_ts_test_hat[1])
(length(y_hat_ts_test_f))
avo_ts1[(12204+i), "lag1_y"] <- y_ts_test_hat[1]
}
Error in cbind2(1, newx) %*% nbeta :
Cholmod error 'X and/or Y have wrong dimensions' at file ../MatrixOps/cholmod_sdmult.c, line 90
The loop gives back an error, which we, unfortunately, could not resolve (Also because this loop takes long time to run).
However, it somehow worked to gather prediction based on previous day prediction as variable. We can see it in here:
length(y_hat_ts_test_f)
[1] 6044
y_hat_ts_test_f <- unlist(y_hat_ts_test_f)
It is supposed to have 6046 numbers, but maybe due to the error it lost 2 numbers somewhere in the loop. We suppose that 2 out of 6046 will not add a significant change, so in order to calculate MSE properly, we have decided to substitute the lost 2 numbers with an average of known 6044:
y_hat_ts_test_f <- c(y_hat_ts_test_f, mean(y_hat_ts_test_f), mean(y_hat_ts_test_f))
length(y_hat_ts_test_f)
[1] 6046
We need to also readjust test data because of the loop changes:
avo_test_ts <- avo_ts[12203:18248,]
y_ts_test <- avo_test_ts$AveragePrice
We now can calculate MSE:
mse_ts_test_f <- mean((y_ts_test - y_hat_ts_test_f)^2)
print(mse_ts_test_f)
[1] 0.1113422
The MSE shown above is the lowest MSE among all models we have run. This may implicate that for time series dataset it is better to use previous day prediction as a variable for next day prediction. However, it is quite incovenient and time-consuming to adjust models for variables and predict each row separately.
Let’s see if plot can show the difference.
Need to select and prepare data for the plot first:
avo %>%
select(Date, AveragePrice) -> plot_ts_full
plot_ts_full <- plot_ts_full[2:nrow(plot_ts_full),]
y_hat_train_ts <- y_ts_train_hat[1:12202,]
y_hat_ts_full <- c(y_hat_train_ts, y_hat_ts_test_f )
plot_ts <- cbind(plot_ts_full, y_hat_ts_full)
Plot:
plot_ts %>%
group_by(Date) %>%
summarise(mean_y = mean(AveragePrice),
mean_yhat = mean(y_hat_ts_full)) %>%
ggplot() +
geom_line(aes(x = Date, y = mean_y), col = "#356211") +
geom_line(aes(x = Date, y = mean_yhat), col = "#cda989") -> ts_graph
ts_graph + geom_vline(aes(xintercept = as.numeric(Date[113])),
linetype = "dashed", size = 1,
color = "#AA471F") +
labs(title = "Lasso with Loop",
y = "Mean Price")
The graph does not show much of the difference, however we know that it is better because it has lower MSE.